Fix time source mismatch in client RPC path after #3268 migration by hjwsm1989 · Pull Request #3283 · apache/brpc

hjwsm1989 · 2026-04-25T14:13:11Z

已按 @chenBright 在 #issuecomment-4319830604 的建议改成"回退 LocalityAwareLoadBalancer 这一侧的改动"，比迁移整条调用链更局部、更保守。

这个 PR 解决什么问题

#3268（"Use monotonic time instead of wall time"）把
LocalityAwareLoadBalancer::Weight::Update 的 end_time_us 与
LocalityAwareLoadBalancer::Describe 的 now 切到了
butil::cpuwide_time_us()，但所有产生 CallInfo::begin_time_us 的调用
方还在用 butil::gettimeofday_us()：

Channel::CallMethod（channel.cpp:451）→ Controller::IssueRPC →
Controller::Call::begin_time_us → SelectIn::begin_time_us →
CallInfo::begin_time_us
Controller::OnVersionedRPCReturned 中的 backup-request 与普通重试入口
（controller.cpp:672, 715）都是 IssueRPC(butil::gettimeofday_us())

时钟域错配导致：

const int64_t end_time_us = butil::cpuwide_time_us();         // 单调时钟，约 5e6
const int64_t latency = end_time_us - ci.begin_time_us;       // - 墙上时钟（约 1.7e15）
                                                              // ≈ -1.7e15 us
if (latency <= 0) {
    // time skews, ignore the sample.
    return 0;
}

每个 sample 都被 time-skew 短路丢掉。_time_q 永远拿不到新条目，
_avg_latency 一直是 0，locality-aware 反馈机制实际被静默关掉。

下游可观测的现象：冷启动场景下，list:// channel + lb=la + 2 个
backend、max_retry=1，第一个 backend 不通时，retry 偶发返回
EHOSTDOWN（Fail to select server from list://...），即使第二个
backend 是健康的。

Bisect / 复现

我在内部 fork 上对 brpc 1.16.1 与 1.17.0-rc2 之间的 51 个 commit 做了
bisect（master 代码不变，只换 brpc 版本），每一步在 2-backend
list:// + lb=la + max_retry=1 下跑探测 RPC 重复 500 次：

步骤	commit	失败
good	`c41e838a` Release 1.16.1	0/500
...	（4 个 GOOD 步骤）	...
#47	`771de31e` Fix UAF in LatencyRecorder	0/500
#48	`12fb539a` Use monotonic time (#3268)	25/500
#49	`b223e60b` Fix RDMA resource	30/500
bad	`604dad0c` 1.17.0-rc2	28/500

在 12fb539a 上做单行隔离实验：只把

 int64_t LocalityAwareLoadBalancer::Weight::Update(const CallInfo& ci, ...) {
-    const int64_t end_time_us = butil::cpuwide_time_us();
+    const int64_t end_time_us = butil::gettimeofday_us();

撤回（保留 #3268 其余所有改动），测试就回到 0/500 PASS。这把回归精确钉
到了那一行 end_time_us 切换。

改动内容

按 @chenBright 的建议，把 LocalityAwareLoadBalancer 一侧从 #3268 引入
的 cpuwide 改动回退，让它继续使用 gettimeofday_us()，与现有调用方对
齐：

文件	行号	改动
`src/brpc/policy/locality_aware_load_balancer.cpp`	21	include 注释回退到 `gettimeofday_us`
`src/brpc/policy/locality_aware_load_balancer.cpp`	379	`Weight::Update::end_time_us` 回退
`src/brpc/policy/locality_aware_load_balancer.cpp`	527	`Describe::now` 回退

Channel::CallMethod 与 Controller::OnVersionedRPCReturned 中那些
butil::gettimeofday_us() 调用都不用动；Controller::_begin_time_us、
Controller::latency_us() 等公开 API 的墙上时钟语义保持不变。

测试

新增 test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source，
不依赖 Server / Channel 的 LB 单元测试。它驱动若干次
SelectServer + Feedback 循环，begin_time_us 用
butil::gettimeofday_us()（与 Channel::CallMethod 当前行为一致），并
断言 _avg_latency 反映了真实的过期时间，而不是被 time-skew 短路卡在
0。

如果将来有人再把 Weight::Update::end_time_us 切走，或者某个
CallInfo::begin_time_us 的调用方漂到了别的时钟域，这个测试都会失败。

不起 Server / 不依赖网络，CI 上是确定性结果（本地 ~16ms）。
la_sanity 与 la_selection_too_long 也都仍然通过。

Check List

没有公开 API 变更。
新增测试 la_records_latency_with_consistent_time_source。
向后兼容：Controller::_begin_time_us / latency_us() 保持墙上
时钟语义不变。

hjwsm1989 · 2026-04-25T14:13:34Z

Hi @chenBright — could you take a look when you have time? This PR completes the time-source migration started in #3268; the bisect in the description pinpoints LocalityAwareLoadBalancer::Weight::Update's end_time_us (switched to cpuwide_time_us() in #3268) being inconsistent with the wall-clock CallInfo::begin_time_us plumbed in from Channel::CallMethod and the retry sites in Controller::OnVersionedRPCReturned. Thanks!

chenBright · 2026-04-25T14:30:20Z

I think it would be more appropriate to roll back the changes to LocalityAwareLoadBalancer.

hjwsm1989 · 2026-04-25T14:40:02Z

I think it would be more appropriate to roll back the changes to LocalityAwareLoadBalancer.

ok

PR apache#3268 ("Use monotonic time instead of wall time") switched LocalityAwareLoadBalancer::Weight::Update's end_time_us and LocalityAwareLoadBalancer::Describe's now to butil::cpuwide_time_us(), but every caller that supplies CallInfo::begin_time_us still uses butil::gettimeofday_us(): - Channel::CallMethod (channel.cpp:451) -> Controller::IssueRPC -> Controller::Call::begin_time_us -> SelectIn::begin_time_us -> CallInfo::begin_time_us - Controller::OnVersionedRPCReturned retry sites (controller.cpp:672, 715) call IssueRPC(gettimeofday_us()) on backup-request and regular retries The mismatched time domains make latency = end_time_us - ci.begin_time_us = cpuwide_now - wallclock_begin ~= -1.7e15 us trigger the `if (latency <= 0) { /* time skews, ignore */ return 0; }` short-circuit on every call. _time_q never accumulates samples, _avg_latency stays at 0, and locality-aware weight feedback is silently disabled. Visible downstream symptom: cold-start `list://` channels with `lb=la` and 2 backends occasionally fail RPCs with EHOSTDOWN ("Fail to select server from list://...") on retry even when one backend is healthy. Bisected reproduction in xsky/brpc fork: - 51 commit range c41e838..604dad0c (1.16.1 .. 1.17.0-rc2) - master code + LA-driven multipath probe at 2 backends, max_retry=1, repeat 500x: * commit 771de31 (one before apache#3268): 0/500 fail * commit 12fb539 (apache#3268): 25/500 fail * commit 12fb539 + revert only Weight::Update::end_time_us to gettimeofday_us: 0/500 fail This commit reverts the LA-side of apache#3268's clock change so the LB lines up with its existing callers again. Channel::CallMethod and the retry paths in Controller stay on butil::gettimeofday_us(), which preserves the wall-clock semantics of Controller::_begin_time_us / Controller::latency_us() that public users rely on. Adds test/brpc_load_balancer_unittest.cpp::la_records_latency_with_consistent_time_source which drives a series of SelectServer + Feedback cycles against LocalityAwareLoadBalancer (no Server / Channel needed) and asserts that _avg_latency reflects the elapsed time, rather than being stuck at 0 because of a time-source mismatch.

hjwsm1989 · 2026-04-25T14:48:35Z

@chenBright 谢谢 review，已按你的建议改成回退 LocalityAwareLoadBalancer 一侧的 #3268 改动（Weight::Update::end_time_us、Describe::now 与 include 注释），不动 Channel::CallMethod / Controller::OnVersionedRPCReturned 的调用方，这样 Controller::_begin_time_us / latency_us() 的墙上时钟语义保持原样。

新增的 la_records_latency_with_consistent_time_source 测试也对应调整：直接用 butil::gettimeofday_us() 作为 CallInfo::begin_time_us（匹配现有 caller），断言 _avg_latency > 0 —— 如果将来再有人把 Weight::Update 的时钟切走，或者哪个 caller 漂到别的时钟域，都会被这个测试抓到。

force-pushed 到 2c285ff2，PR description 也同步更新到了新的 revert 思路下。再麻烦你看一下，谢谢！

wwbmmm · 2026-04-26T02:30:01Z

LGTM

chenBright

LGTM

hjwsm1989 force-pushed the fix-la-time-source-mismatch branch from 9a6dda3 to 2c285ff Compare April 25, 2026 14:47

chenBright approved these changes Apr 26, 2026

View reviewed changes

chenBright merged commit 6ddefce into apache:master Apr 26, 2026
31 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix time source mismatch in client RPC path after #3268 migration#3283

Fix time source mismatch in client RPC path after #3268 migration#3283
chenBright merged 1 commit intoapache:masterfrom
hjwsm1989:fix-la-time-source-mismatch

hjwsm1989 commented Apr 25, 2026 •

edited

Loading

Uh oh!

hjwsm1989 commented Apr 25, 2026 •

edited

Loading

Uh oh!

chenBright commented Apr 25, 2026

Uh oh!

hjwsm1989 commented Apr 25, 2026

Uh oh!

hjwsm1989 commented Apr 25, 2026

Uh oh!

wwbmmm commented Apr 26, 2026

Uh oh!

chenBright left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hjwsm1989 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

这个 PR 解决什么问题

Bisect / 复现

改动内容

测试

Check List

Uh oh!

hjwsm1989 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenBright commented Apr 25, 2026

Uh oh!

hjwsm1989 commented Apr 25, 2026

Uh oh!

hjwsm1989 commented Apr 25, 2026

Uh oh!

wwbmmm commented Apr 26, 2026

Uh oh!

chenBright left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hjwsm1989 commented Apr 25, 2026 •

edited

Loading

hjwsm1989 commented Apr 25, 2026 •

edited

Loading